Local n-grams for Author Identification Notebook for PAN at CLEF 2013

نویسندگان

  • Robert Layton
  • Paul A. Watters
  • Richard Dazeley
چکیده

Our approach to the author identification task uses existing authorship attribution methods using local n-grams (LNG) and performs a weighted ensemble. This approach came in third for this year’s competition, using a relatively simple scheme of weights by training set accuracy. LNG models create profiles, consisting of a list of character n-grams that best represent a particular author’s writing. The use of a weighted ensemble improved upon the accuracy of the method without reducing the speed of the algorithm; the submitted solution was not only near the top of the leaderboard in terms of accuracy, but it was also one of the faster algorithms submitted. The authorship identification task at PAN 2013 was a variation on a standard authorship analysis task of authorship attribution. In authorship identification, we have a training set of documents from the same author and a test document of unknown authorship. The task is to determine whether the author of the training documents was the one that wrote the test document. This task is different from authorship attribution in a few ways. First, we cannot simply take a ‘best guess’ whereby we find the best matching author. A decision on match or no match must be made, similar to the open set problem of authorship attribution, whereby the actual author may not be in the candidate set. Second, we have no point of reference to compare the similarity of author to document. In other words, we cannot know relatively if two profiles are similar and must therefore find algorithms that are able to know absolutely if two profiles match. Third, specifically for this task, the number of documents was small. Most problems in this task had just three documents from the same author, reducing the ability to determine variance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Vector Space Model and Overlap Metric for Author Identification Notebook for PAN at CLEF 2013

This paper describes our entry for the Author Identification task at PAN 2013. The Author Identification task was performed using a combination of Vector Space Model [1] (VSM) and Similarity Overlap Metric [3] (SOM) on the character n-grams extracted from the documents related to an author and the document of question. A combination of the VSM and SOM provided an overall F-measure, precision an...

متن کامل

A Basic Character N-gram Approach to Authorship Verification Notebook for PAN at CLEF 2013

This paper describes our approach to the Author Identification task in the PAN 2013 evaluation lab. We use a profile-based approach and use the common n-grams (CNG) method that employs a normalized distance measure for short and unbalanced text introduced by Stamatatos[6]. We achieved the 9th place with an overall F1 score of 0.6.

متن کامل

Grammar Checker Features for Author Identification and Author Profiling Notebook for PAN at CLEF 2013

Our work on author identification and author profiling is based on the question: Can the number and the types of grammatical errors serve as indicators for a specific author or a group of people? In order to detect the grammatical errors we base our approach on the output of the open-source library LanguageTool. In the case of the author identification we transform the problem into a statistica...

متن کامل

Proximity Based One-class Classification with Common N-Gram Dissimilarity for Authorship Verification Task Notebook for PAN at CLEF 2013

We describe our participation in the Author Identification task of the PAN 2013 competition. This competition task presents participants with a set of authorship verification problems. In each such a problem, one is given a set of documents written by one author and a sample document; the task is to answer the question whether or not the sample document was written by the same author as the rem...

متن کامل

ITALICA at PAN 2013: An Ensemble Learning Approach to Author Profiling Notebook for PAN at CLEF 2013

This notebook discusses the approach to the Author Profiling task developed by the Italica group for PAN 2013. This system implements two different sets of classifiers which are combined later in order to build a final classifier that takes into account the decisions of the previous ones. The initial classifiers are focused on vector space representations of the documents as a bag of words and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013